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(57) Abstract 

The present invention is a method and apparatus 
that measures the similarity of two images. Any informa- 
tion that can be discretely symbolized can be trans- 
formed into an image through so-called "image projec- 
tion'^. This process is used to define otherwise discrete 
entities as part of a linear space, making it possible to 
calculate distances among those entities. A mechanism 
called a cluster allows association of otherwise discrete 
symbols, improving the matching abilities of the inven- 
tion. Initially, the sequence of symbols is normalized 
(302). Then a projection (304) of the normalized se- 
quence is created. The projection may be optionally gen- 
erated with a cluster (305) that assigns weights to the 
neighbors of a core symbol and/or with position weights 
(306) that assigns weights to each position in the normal- 
ized image. Projection matching (310) is then performed 
to determine match candidate for the string of symbols. 
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METHOD AND APPARATUS FOT? COMPARISON 
OF DATA STRINGS 

i 

' i ■ 

5 FIELD OF THE INVENTION 

This invention relates to the field of data comparison. 
BACKGROUND ART 

10 

In data storage systems or data base systems, it is often desired to retrieve 
blocks of data in response to a query. In other cases, an unknown block of data 
is compared with stored blocks of data as a means of identifying the unknown 
block of data. In some cases, there is no stored block of data in the data base 

15 that matches the query. Similarly, there may be no matching stored block of 
data for a given unknown block of data. However, it may be useful to prbvide 
information about the blocks of data that are closest to matching the query 
block of data. This is particularly true in spell check programs where a word is 
misspelled and the most likely replacement word is to be determined. A 

20 system for determining the best match for a particular block of data is known 
as a word comparator, string matching scheme, or matching algorithm. 

In the prior art, such matching is accomplished by relatively 
straightforward algorithms that seek to identify common characters or symbols 
25 between two strings. For example, a "left-to-right" comparison of two strings is 
performed until common characters are found. The common characters are 
then aligned and a "right-to-left" comparison is performed. This algorithm 
only identifies typographic differences between two strings. 

30 There are prior art patents that describe matching schemes that include 

methods for determining the degree of similarity between two strings. Both 
Parvin 4,698,751 and Parvin 4.845.610 describe a string to string matching 
method in which a "distance" between the two strings is calculated. "Distance" 
in Parvin is defined as the minimum number of editing operations (such as 

35 adding a character, deleting a character and substituting for a character) needed 
to convert one string to the other. 



Yu et al., U. S. Patent 4,760,523, describes a "fast search processor" for 
searching for a predetermined pattern of characters. The processor includes 
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serially connected cells each of which contain a portion of the pattern. The , 
character set being searched is sent in a serial fashion through the serially 
connected cells. Match indicators record each match between the pattern in a 
cell and the character stream flowing through the cell. ' , 

5 ■ ~ • 

Hartzban d et al.. U. S. Patent 4,905,162 describes a method for 
determining the similarity between objects having characteristics, that are 
specified on a reference table. Weights for each characteristic may be specified 
by a user. A numerical value for the similarity between objects is calculated 
10 based on an element by element comparison of each characteristic. 

U. S. Patent 4,979,227 to Mittelbach et al. describes a method, in an 
optical character recognition context, for recognizing a character, string by 
comparing the string to a lexicon of acceptable character strings. The best ♦ 

15 matching character strings from the lexicon are selected, and tested to see 
whether substitutions that would convert the original string to the lexicon 
string are permitted. An example of a permitted substitution'would .be 
substituting an "1" for an "i", since these characters are similar in appearance. 
The actual comparison process is not described in this patent. 

20 • 

Fujisawa et al., U. S. Patent 4,985,863 describes a document storage and 
retrieval system in which both image and text files of the document are stored 
in memory. The desired image file is selected by searching the associated text 
file. The text file, which may be generated by optical character recognition 
25 methods applied to the image files, contains special characters that indicate 
ambiguous characters. Possible alternatives may be provided for an 
ambiguous character. For example, if a character is recognized as being possibly 
an "o" or an "a", both these characters are listed together with the special 
characters indicating the existence of an ambiguity. 

30 

U. S. Patent 5,008,818 to Bocast describes a method and apparatus for 
reconstructing altered data strings by comparing an unreconstructed string to 
"vocabulary" strings. The comparison is done on a character by character basis 
by moving pointers from the beginning to the end of the unconstructed string, 
35 one of the pointers indicating the character being compared, the second acting 
as a counter for the number of correct comparisons. The comparison is under 
certain conditions also done from the back to the front of the string. A 
"reconstruction index" indicating the similarity between the unconstructed 
string and the vocabulary string is calculated from the positions of the pointers. 
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U. S. Patent 5,060,143 to Lee describes a method for comparing strings of 
characters by comparing a target string to sequential blocks of candidatestrings. 
By comparing the target string to sequential portions of the candidate strings, 
5 rather than to the candidate string as a whole, performance is improved by 
eliminating redundant comparisons. An early Vtime out" feature determines 
early during the comparison process whether the candidate string can possibly 
be a valid match. If not, the comparison to that candidate string is aborted and 
a comparison to the first block of the next candidate string is begun. 
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SUMMARY OF THE PRESENT INVENTION 

The present invention is a method and apparatus th^t measures the 
similarity of two images. Any information tijat can be discretely symbolized 
5 can be transformed into an image through so-called "image projection". This 
process is used to define otherwise discrete entities as part of a linear space, 
making it possible to calculate distances among those entities. * A mechanism 
called a cluster allows association of otherwise discrete symbols, improving the 
matching abilities of the invention. Cluster tables are created that reflect 
10 symbol relationships. By adjusting the cluster tables, the outcome of similarity 
ranking can be controlled. 

The invention is used to measure the similarity between two strings of 
symbols. The invention generates scaled scores that represent the degree of 
15 matching between two vectors. The invention can be used as a spelling 
correction tool, a phonetic matching scheme, etc. 

The process of image projection transforms a string into a real-valued 
vector. When searching for best matches in a large space, projection vectors 
20 can be used to create an index in the search space. With a proper indexing 

method, the best matches for a query can be found in the same time as required 
to search for an exact match. 

The present invention operates in several steps. Initially, the sequence 
25 of symbols is normalized. Then, a projection of the normalized sequence is 
created. The projection may optionally be generated with a cluster that assigns 
weights to the neighbors of a core symbol and/or with position weights that 
assigns weights to each position in the normalized image. Projection matching 
is then performed to determine match candidates for the string of symbols. 
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BRIEF DESCRIPTION OF THE DRAWTMf^ 

i 

Figure 1 is a flow diagram of the operation of the present invention, 

5 Figure 2 is a flow diagram illustrating the preferred embodiment of the 

present invention, 

1 # 

Figure 3 is a block diagram illustrating the preferred embodiment of the 
present invention. 
10 ' 

Figure 4 is a block diagram of an example of a computer system for 
implementing the present invention. . . 
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DETAILED DESCRIPTION OF THE INVENTION 

A method and apparatus for comparing data strings is, described. In 
the following description, numerous specific, details, such as normalized, 
5 normalization values, weight values, etc., are described in order to provide a 
more thorough description of the present invention. It will be apparent, 
however, to one skilled in the art, that the present invention may be 
practiced without these specific details. In other instances well known 
features have not been described in detail so as not to obscure the present 
10 invention. 

The present invention operates as follows: 

1. Normalize sequence of symbols. 
15 2- Create projection of normalized sequence. (Optionally with a 

cluster that assigns weights to the neighbors of a core symbol 
and/or with position weights that assign weights to each position 
in the normalized image.) 
3. Perform projection matching. 

20 

The present invention may be implemented on any conventional or 
general purpose computer system. An example of one embodiment of a 
computer system for implementing this invention is illustrated in Figure 4. 
A keyboard 410 and mouse 411 are coupled to a bi-directional system bus 

25 419. The keyboard and mouse are for introducing user input to the 
computer system and communicating that user input to CPU 413. The 
computer system of Figure 4 also includes a video memory 414, main 
memory 415 and mass storage 412, all coupled to bi-directional system bus 
419 along with keyboard 410, mouse 411 and CPU 413. The mass storage 412 

30 may include both fixed and removable media, such as magnetic, optical or 
magnetic optical storage systems or any other available mass storage 
technology. The mass storage may be shared on a network, or it may be 
dedicated mass storage. Bus 419 may contain, for example, 32 address lines 
for addressing video memory 414 or main memory 415. The system bus 419 

35 also includes, for example, a 32-bit data bus for transferring data between 
and among the components, such as CPU 413, main memory 415, video 
memory 414 and mass storage 412. Alternatively, multiplex data/address 
lines may be used instead of separate data and address lines. 
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In the preferred embodiment of this invention, the CPU 413 is a 32-bit 
microprocessor manufactured by Motorola, such as the 68030 or 68040, or Intel, 
such as the 80386 or 80486. However, any other suitable microprocessor or , 
microcomputer may be utilized. 

Main memory 415 is comprised of dynamic random access memory 
(DRAM) and in the preferred embodiment of this invention, comprises 8 
megabytes of memory. More or less memory may be used without departing 
from the scope of this invention. Video memory 414 is a dual-ported video 
random access memory, and this invention consists, for example, of 256 
kbytes of memory. However, more or less video memory may be provided 
as well. 



10 



One port of the video memory 414 is coupled to video multiplexer and 
15 shifter 416, which in turn is coupled to video amplifier 417. The video 

amplifier 417 is used to drive the cathode ray tube (CRT) raster monitor 418. 
Video multiplexing shifter circuitry 416 and video amplifier 417 are well 
known in the art and may be implemented by any suitable means. This 
circuitry converts pixel data stored in video memory 414 to a raster signal 
20 suitable for use by monitor 418. Monitor 418 is a type of monitor suitable for 
displaying graphic images, and in the preferred embodiment of this invention, 
has a resolution of approximately 1020 x 832. Other resolution monitors may 
be utilized in this invention. 

25 The computer system described above is for purposes of example 

only. The present invention may be implemented in any type of computer 
system or programming or processing environment. 

A flow diagram illustrating the operation of the present invention 
30 is illustrated in Figure.l. At step 101, the symbol sequence to be compared 
is identified and prepared. This involves normalizing the sequence. At 
step 102, a projection of the normalized sequence is generated. The 
projection can be generated with one or both of cluster table 103 and 
weight table 104. 



35 



At step 105, the output of step 102, a real valued vector projection, is 
compared to other vector projections in a projection matching step. 
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NORMALIZATION 

A string of S symbols is stretched or compressed into a normalized 
image of N symbols. The size of each symbol in the normalized image 
5 represents its portion in the string. Suppose, the string of symbols is the word 
"lists", consisting of five letters or symbols (IS I = 5) as shown below: 



1 


i 


s 


t 


s 



Consider the case where the normalized number of symbols is eight so 
10 mat N = 8. The medium of a symbol, M, in normalized image is computed as 
follows: 

M(Si) = i» INI /IS I 

15 where Si is the i-th symbol in string S, INI is the normalized size, and I S I is 
the length of string S. The five symbols of the word list are now compressed 
into eight symbols, with the medium of each symbol being 1.6i, ((N/S) = (8/5) = 
1.6). The normalized size of each symbol is therefor 1.6 normal symbol slots. 

20 Each symbol in the normalized string must have a unitary value. 

. Therefor "1" is placed in the first symbol slot, leaving 0.61 to be placed in the 
second symbol slot, as shown below. To provide a unitary value for the second 
symbol slot, 0.4i is added to 0.61. This leaves 1.2L l.Oi or "i" is placed in the 
third symbol slot, leaving 0.2i for the fourth symbol slot and so on as shown 

25 below. In summary, each symbol from the original string is represented by 1.6 
times that symbol in the normalized string. 



1 


0.61+0.4i 


i 


0.2i+0.8s 


0.8s+0.2t 


t | 0.4t+0.6s | S | 



SYMBOL PROTECTION 

30 

A projection is a real-valued vector that is equally divided into as many 
partitions as members in a symbol set. For example, the symbol set for a 
spelling checker is the set of symbols of the alphabet, numeric characters, and 
punctuation marks. Each partition is called a "closure" Q for its corresponding 
35 symbol i. (A closure is larger than a normalized image). 
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Each symbol in the image is projected onto its closure in normal . 
distribution, with the maximum at its medium. A decreasing series D with length 
I D I can be defined to simulate the normal distribution. D is called distributing 
series, and I D I distribution size. The projection is computed as follows: ' 



C s l M{Si) + \D\+ j d i 



c . ( M(si) + |d I- / ~ d j 



(j = 0,l,2,..., ID I) 



10 where dj is j-th item in distribution series D, and Qk is the k-th item in symbol 
Si's closure. If a symbol occurs more than once and its distribution overlaps, 
only the larger values are kept. 

For example, with distribution series (4, 3, 1) whose length. I D I is 3, the 
15 closures for symbols L, I, S and T have a size of 12 ( I N I + 2 * I D I - 2 = (8+<2*3)- 
2) = 12) and are as follows: 



20 



■L" |l 13 u ~T7 



I I I I 



"i" l l l h 



I I 11 



■s- [ 



1 




! 1 3 4 3! 


14, |3 


' 1 



i i i r 



25 



Note that because there are two instances of the letter "s" in the 
sequence, there are two peaks. Each peak corresponds to an occurrence of "s" 
in the normalized stream. 

The preferred embodiment of the present invention utilizes one of the 
following two distribution tables: 



distribution table #1 : { 21, 19, 17, 14, 1 0, 4, }; 
30 distribution table #2: { 17, 14, 10, 4, }; 



The preferred embodiment of the present invention uses the following 
encoding and decoding tables. 
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10 



15 



20 



25 



Encoding Table: { 

31, 31, 31, 31, 31, 31, 31, 31, 
31, 31, 31, 31, 31, 31, 31, 31, 
31, 31, 31, 31, 31, 31, 31, 31, 
31, 31, 31, 31, 31, 31, 31, 31, 
31, 31, 31, 31, 31, 31, 31, 26, 
31, 31, 31, 31, 31, 27, 28, 31, 
31, 31, 31, 31, 31, 31, 31, 31, 
31, 31, 31, 31, 31, 31, 31, 31, 
31, 0, 1„ 2, 3, 4, 5, 6, 
7, 8, 9, 10, 11, 12, 13, 14, 
15, 16, 17, 18, 19, 20, 21, 22, 
23, 24, 25, 31, 31, 31, 31, 31, 
31, 0, 1, 2, 3, 4, 5, 6, 
7, 8, 9, 10, 11, 12, 13, 14, 
15, 16, 17, 18, 19, 20, 21, 22, 
23, 24, 25, 31, 31, 31, 31, 31 

313131313131313131,3131,3131,313131, 
31313131313131,313131313131313131, 
3131^131313131313131313131313131, 
31313131313131313131313131313131, 
31313131313131313131313131313131, 
31313131^13^1313131313131313131, 
31313131313131313131313131313131, 
31313131313131313131313131313131, 



}; 



/* 0-7*/ 
/* 8-15 */ 
/* 16-23 */ 
/* 24-31 */ 
/* 32-39 */ 
/* 40-47 */ 
/* 48-55 V 
/* 56-63*/ 
/* 64-71*/ 
/* 72-79*/ 
/* 80-87*/ 
/* 88-95 */ 
/* 96-103*/ 
/* 104-111 */ 
/* 112-119*/ 
/* 120-127 */ 
/* 128-143*/ 
/* 144-159 */ 
/* 160-175 */ 
/* 176-191 */ 
/* 192-207 */ 
/* 208-223*/ 
/* 224-239*/ 
/* 240-255 */ 



Decoding Table: f 

•a', -b', 'a', *d', 'e\ 'f, 'g', ' 
30 T 7 -j', -k', T, 'm', 'n', 'o', ' 

'q', 'r', 's', 'f, 'u*, V, 'w', ■ 
y, 'z', , Vy- , / 255, 255, 255 

}; 



/* 0-7*/ 
/* 8-15 */ 
/* 16-23 */ 
/* 24-31 */ 



35 PROTECTION WITH CT.TJSTFRg 



A cluster is a subset of the character set which contains a core character 
and any number of neighbor characters- It is used to represent relationships 
among characters. For example, the following cluster: 
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{ { MAP('a% 8 }, { MAP('s% 2 }, { MAPOe'), 1 }, { 0, 0 } } 
indicates that 's' and 'e' are close to 'a' and 's* is closer to 'a' than 'e* is. 
The parameter clusters are an array of dusters, each cluster corresponds to a 
character in the character set as the core. IjJote that MAP() used above is to 
5 indicate that ASCII is usually not used for the comparison scheme of the 
present invention, and ASCII characters are mapped into a smaller set. The 
mapping function is used to reduce the size of the character' set for memory 
optimization. The memory space for the full ASCII set may not be available. 
In addition, the actual symbols of interest, such as in a spelling checker, may be 
10 fewer than in the entire ASCII set. Therefore, the characters can be mapped 
into a smaller set. In one embodiment, characters are mapped into a space 
from 0 to 32. 

A cluster Uj defines weights for neighbors of the core symbol i; uh is the 
15 weight of i itself. Every symbol is the core of its cluster. In simple projection, a 
cluster has a core as its only symbol, and the weight for the core is 1. 

When a cluster has more than one symbol, or the core symbol has 
neighbors, the,.core symbol is not only projected to its own closure but also its 
20 neighbors' closures. The projection becomes: 

C n M (S ( ) + |dI+/ = * i U S ( n 

(j = 0,1, 2,..., ID I) 



25 



30 



C » AI(S ( ) + |d|-, ~ d j* U S t r 



where n is a member of the cluster of S. 



Clusters are used to associate otherwise discrete symbols. The use of 
clusters can provide a means to tailor queries to provide desired results. For 
example, consider the word "communicate" and two misspellings of that 
word, namely "communikate" and "communigate". It may be desirable to 
implement the present invention so that "communikate" sliows a higher 
degree of matching than "communigate" (because the "k" sound is more like 
the hard "c" sound of "communicate"). By including "k" in the cluster of "c", 
35 the present invention can show that "communikate" is more likely to be 
"communicate" than is "communigate". 
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The present invention implements clusters through cluster tables. An 
example of a cluster table for the 26 lower case alpha characters is illustrated ' 
below. Each letter is followed by paired values. The first value of each pah- 
represents the numerical designation of another letter in the cluster -of the core 
5 letter. The second number in each paired value represents the* weight to be ' 
given the cluster letter as a substitute for the core letter. 

, , t 

The first pair is the core letter itself and the value to be 'given, when it is 
a match. For example, for the letter "a", the first pair "{0, 8} represents letter 
10 "0" (i.e. "a") and its match weight of 8. Continuing with the letter "a", the next 
pair is for the cluster letter "e", and its match weight of 4. the next two duster 
letters, "o" (letter 14), and "s" (letter 18), have match weights of 2. The fourth 
cluster letter, "i", has a match weight of 4. For the letter "a", the letters "e" and 
"i" are more often by mistake than the letters "o" and "s". 

15 

The cluster table values associated with each letter represent letters that 
may have the same sound as a letter (i.e. ,v k" and hard "c", "s" and "c") or that 
are near to each other on a standard "qwerty" keyboard, and are therefore 
likely to be mis-stroke. The following cluster table is given by way of example 
20 only. Other cluster tables may be used without departing from the scope or 
spirit of the present invention. 

cluster table: { 



25 



30 



35 



a: 


{{0,8}, 


{4,4}, {14,2}, {18,2}, {8,4}, 


{0,0}}, • 


b: 


{{1,8}, 


{21,2}, {13,2}, {3,2}, {0,0}}, 




c: 


{{2,8}, 


{18,4}, {10,4}, {23,2}, {21,2}, {25,2}, {0,0}}, 


d: 


{{3,8}, 


{18,2}, {5,2}, {1,2}., {0,0}}, 




e: 


{{4,6}, 


{0,3}, {8,3}, {14,2}, {22,2}, 


{17,2}, {20,2}, {0,0}}, 


f : 


{{5,8}, 


{21,4}, {6,2}, {3,2}, {15,4}, 


{7,4}, {0,0}}, 


5- 


{{6,8}, 


{9,4}, {5,2}, {7,2}, {0,0}}, 




h: 


{{7,8}, 


{5,4}, {6,2}, {9,2}, {0,0}}, 




i: 


{{8,8}, 


{24,4}, {4,3}, {14,2}, {20,2}, 


{0,4}, {0,0}}, 


j: 


{{9,8}, 


{6,4}, {10,2}, {7,2}, {0,0}}, 




k: 


{{10,8}, 


{2,4}, {23,4}, {16,4}, {9,2}, 


{11,2}, {0,0}}, 


1: 


{{11,8}, 


{17,2}, {10,2}, {0,0}}, 




m: 


{{12,8}, 


{13,4}, {0,0}}, 




n: 


{{13,8}, 


{12,2}, {1,2}, {0,0}}, 




o: 


{{14,8}, 


{20,2}, {4,3), {0,2}, {8,3}, 


{15,2}, {0,0}}, 


P: 


{{15,8}, 


{5,4}, {14,2}, {0,0}}, 
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15 



20 



35 



q: 


{ (16,8), 


{10,4}, 


{22,2} , 

* 9 * 9 


{0,0}), 




r: 


({17,8}, 


{11,2), 


{4,2} , 


{19,2}, {0,0}}, 




s: 


{{18,8}, 


{2,4}, 


{25,4}, 


{23,4}, {0,2}, {3,2}. {0, 


0) ) , 

• 


t: 


{{19,8}, 


{17,2), 


{24,2}, 


{0,0}},, 


u: 


{{20,8}, 


{14,2), 


{8,2} , 


{4,2}, {22,4}. {0,0}}, 




v: 


{{21,8), 


{22,4}, 


{5,4} , 


{1,2} , {2,2}, {0', 0} ) , 




w : 


{{22, 8}, 


{21,4}, 


{16,2}, 


{4,2), {20,4}, {0,0)) # , ' ■ 




x: 


{{23,8), 


{10,4}, 


{18,4}, 


{25,2), {2,2), {0,0)), 


♦ 


y : 


{{24, 8), 


{20,2}, 


{19,2}, 


{8,4}, {0,0}), 




z: 


{{25,8), 


{18,4}, 


{23,2}, 


{2,2},- {0,0)}, 






(other character's 


cluster) 






{{26,8}, 


{0,0}}, 










{{27,8}, 


{0,0}}, 




* • 






{ 128,8), 


{0,0}}, 










{{29,8}, 


{0,0}}, 










{{30,8}, 


{0,0}}, 










{{31,8}, 


(0,0)} 






1 



The use of a cluster table is optional in the present invention. 

» 

PROTECTION WITH POSITION WEIGHTS 



In addition to, or instead of, the use of weights in clusters, weights, w, 
can be assigned to each position in the normalized images. When a symbol is 
25 projected in the image, the distribution value and cluster weight can be 

multiplied with the weight associated with the symbol's position. Note that 
the weights are assigned to the normalized image instead of the original string. 

It is often the case that words are misspelled at the beginning rather than 
30 in the middle or end. The position table can be used to indicate that the first 
two positions of a word have twice the weight compared with others. Thus, 
the first two characters in the string will have more significant impact on the 
similarity comparison. So, if the beginnings of two words are the same, they 
are more likely to be the same word. 



When using position weights with cluster tables, the projection 
becomes: 



C »M(S.) + |DU; ~ &\ W M{s i ) u s. n 
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c .MH f) + bl-/ */* W M ( s/ *s,» ; (3=0,1,2,..., IDI) 

When using positionweights without clusters, the projection becomes: 

5 ' . ' ' 

C fiM(S ( ) + |o|+/ = *j W M(S.) 

c „M(s 4 > + bl-/- / ^M(s f , (3=0,1,2,..., IDI) 

10 The following code may be used to generate projections from a given 

symbol string: 



* NAME 

* zfmpro js - Projection Matching: generate projections 
15 * DESCRIPTION 

* generate projections from the string given and touch 

* characters reached 

20 static eword 

zfmpro js (pe_p, str, slen, projs) 

regO zfmpenv *pe_p; 

text *str; 

eword slen; 
25 ub2 *projs; 

{ 



30 



35 



reg6 eword 


pp; 


/* 


count the positions in projection 


regl ub2 


*prjptrl; 


/* pointers to go thru a proj */ 


reg7 ub2 


*prjptrr; 






reg3 ub2 


*dptr; 


/* 


pointer to go thru dist[] */ 


reg2 eword 


ss; 


/* 


score for a position */ 


reg8 zfmpclup 


clstptr; 


/* 


pointer to go thru a cluster */ 


reg3 text 


ch; 


/* 


a char in the cluster */ 


reg9 text 


core; 


/* 


core char */ 


regl 4 eword 


score ; 


/* 


score for the char */ 


reglO eword 


cc; 


/* 


count the chars in string */ 


regll eword 


sum; 


/* 


total score */ 


eword xO; 




/* 


beginning of a distribution */ 
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/* following variables are copied from zfmpenv */ 

regl3 eword size; /* size of the char set .*/ 

regl5 eword neighbors/ /* neighbors */ 

5 regl2 ub2 *dist; /* distribution ?*/ 

eword closure; /* size of the projection */ 

eword , npos; /* number of positions */ 

.. ^2 *poswts; /* pointed to the weight table */ 

10 /* get info from the 2FMP structure */ 

size = pe _p->pe_size; 

neighbors = pe_p->pe_neighbors; 
closure - pe — p->pe_closure; 
npos = pe_p->pe_npos; 

dist pe_j>->pe_dist; 
poswts » pe_p->pejposwts; 



15 



20 



25 



30 



35 



/* initialize work areas */ 

for (prjptrl - projs, pp « size * closure; pp; — pp, ++prjptrl) 
{ 

*prjptrl = (ub2)0; 

} 

sum = (eword) 0; /* sum is accumulated */ 

/* for each char (as a core) in the string */ 

for (cc - (eword) 1, ++slen; cc < slen; ++cc, ++str) 
{ 

core = *str; 

/* check the range of the core */ 

if (core >*= size) 
{ 

continue; 

} 
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/* locate the char in our projection */ ' 

i . 
if (dec) II (slen — 1)) 

5 ( '• 

xO = (eword)O; /* so that divived-by-0 won't happen */ 

} 

else ♦ 

♦ 

{ 

10 xO - cc * npos / slen; 

} .• ' 

♦ 

♦ 

/* get a cluster, for each char in the cluster, do ...-*/ 

15 for (clstptr - (zfiripclup)pe_p->pe_clusters [core] ; 4 

clstptr->cl_ sc; 
++clstptr) 

{ 

ch = clstptr->cl_ch; 



20 
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35 



/* get the score and mutiply the weigth */ 

score « (eword) clstptr->cl_sc * poswts[x0]; 

/* The char is touched. First compute the 
points at the peak, than set prjptrl and 
pr jptrr at the left and the right of 
the peak, respectively. */ 

prjptrl = projs + ch * closure + xO + neighbors; 
sum *prjptrl = (ub2) (score * dist[0]); 
pr jptrr = (prjptrl—) + 1; 

/* Priptrl and prjptrr are moving toward left 
and right, away from the peak. The position 
they point to have the same score, so that 
ss is only calculated once. */ 



for (pp «= neighbors, dptr = dist + 1; 
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10 



15 



pp; 

— pp, —prjptrl, ++prjptrr, ++dptr) 

I 

ss = score * <*dptr) ; /* compute a score */ 

/* I am not sure whether to accumulate 

points or to keep the highest one, V 

* ♦ 

#ifndef ZFMP ACCUMULATE 

if (ss > *prjptrl) 
< 

sum += ss - *pr jptrl.; 

*prjptrl - (ub2)ss; 
} V 
if (ss > *prjptrr) 
< 

sum += ss - *pr jptrr; 

*pr jptrr = (ub2)ss; # 



20 



25 



30 



lelse 



#endif 



sum += ss + ss; 
*prjptrl += (ub2)ss; 
*pr jptrr += (ub2)ss; 



return (sum) ; 



PROTECTION MATCHING 



After a projection is generated, whether it be a simple projection, a 
projection with clusters, a projection with position weights, or a projection 
35 with both clusters and position weights, a comparison of the model projection 
and the query projection is made to determine the closeness t>f the match. The 
comparison is accomplished using a similarity function. 
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A projection is a series of closures concatenated together. The similarity 
function © of the preferred embodiment of, the present invention is defined as 
follows: 

5 2*P M + jLVis 

where Pi and P2 are two projections to be compared. When two projections 
are identical, or two original strings are identical, the similarity is 1. The 
lowest possible © isO. 
10 ' 

Figure 2 is a flow diagram illustrating the preferred embodiment of the 
present invention. At step 201, do zfmpopenO is performed, zfmpopen opens 
an environment in which other functions operate. It returns a handle to the 
open environment and this handle is kept and passed to other functions to 
15 refer to the same environment, poswts and dist are two 0-terminated integer 
arrays. They are used to adjust the behavior of the comparison mechanism. 
For example, the following setting: 

int poswtsD = { 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1,1, 1, 1, l, 0 }; 
int distO = { 21, 20, 18, 15, 10, 6, 4, 2, 0 }; 
20 gives more priority on the beginning of a string and compensates models that 
- have their characters matched to nearby positions in the query. Usually poswts 
is longer than most of expected strings. The longer dist the more 
compensation is give on matched characters at different positions. But dist is 
not longer than poswts in the preferred embodiment of the present invention. 
25 An extreme case is: 

int poswtsD = { 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0 }; 
int distQ = { 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0 }; 
which sets the string matching to a letter frequency comparison. The 
parameter clusters is an array of clusters, each cluster corresponds to a character 
in the character set as the core. Code for implementing zfmpopenO is 
illustrated in Appendix A. 



30 



At step 202, zfmpqueryO is executed on a query 203. The query is processed, 
a projection is calculated by calling zfmpprojsO at step 208, and the result is stored 
35 in memory. zfmpqueryO sets a new query for an environment. Once a query is 
set, all comparisons made in the environment are based on this query. pe_h 
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indicates the environment, query is the query string, and qlen is the length of 
query. Code for implementing zfmpquery'6 is illustrated in Appendix B. 

At decision block 204, the argument ','Get model?" is made. If the 
5 argument is true, the system proceeds to step 205. If, the argument is false, their 
are no more models, and the system proceeds to step 206, where the best 
matches are displayed. , , 

At step 205, zfmpmodelO is executed. A model is processed by calling 
10 zfmpprojsO at step 208. A projection is returned and compared to the 

projection of the query. Code for implementing zfmpmodel is illustrated in 
Appendix C. At step 207, the similarity value for each model as compared to 
the query is provided. 

15 At step 208, zfmprojsO is used to normalize the input sequence and 

calculate and return its projection. zfmprojsO may use cluster tables and/or 
position weights as desired. 
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EXAMPLE 1 



As noted, the present invention can be implemented without using 
. cluster tables or position weight tables. The invention can be implemented 
using one or both of the cluster table or position weight tables if desired. In 
this example the query is "communicate" and the model is "<omunicate" and 
25 no cluster table or position weight table is used. 

Query: Communicate 
Model: Comunicate 
degree(maximum is 17) 16 
30 similarity: 94.117645%: 

query projection(a): 0 0 0 0 0 0 0 4 10 14 17 14 10 4 0 0 
model projection(a): 0 0 0 0 0 0 0 4 10 14 17 14 10 4 0 0 

35 query projection(c): 4 10 14 17 14 10 4 10 14 17 14 10 4 0 0 0 
model projection(c): 4 10 14 17 14 10 4 10 14 17 14 10 4 0 0 0 



query projection(e): 0 0 0 0 0 0 0 0 0 4 10 14 17 14 10 4 
model projection^): 0 0 0 0 0 0 0 0 0 4 10 14 17 14 10 4 
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10 



query projection(i): 


0 


0 


0 


0 0 4 10 14 17 14 10 4 0 0 


0 


0 


model projection(i): 


0 


0 


0 


.0 0 4 10 14 17 14 10 4 0 0 


0 


0 


query projection(m): 


0 


0 


4 


101417171410 4 0 0 0 0 


0 


• 

0 


model projection(m): 


0 


0 


4 


1014171410 4 0 0 0 0 0 

1 t 


n 


o 


query projection(n): 


0 


0 


0 


0 0 4 10 14 17 14 10 4 0 0 


0 


0 


model projection(n): 


0 


0 


0 


0 4 10 14 17 14 10 4 0 0 0 


n 


n 


query projection(o): 


0 


4 


10 


14 17 14 10 4 0 0 0 0 0 0 


0 


0 


model projection(o): 


0 


4 


10 


14171410 4 0 0 0 0 0 0 


o 


o 


query projection(t): 


0 


0 


0 


0 0 0 0 0 41014171410 


4 


o 


model projection(t): 


0 


0 


0 


0 0 0 0 0 41014171410 


4 


0 


query projection(u): 


0 


0 


0 


0 41014171410 4 0 0 0 


0 


0 


model projection(u): 


0 


0 


0 


41014171410 4 0 0 0 0 


0 


0 



20 Note that because cluster tables are not used, only the letters of the 

model (namely, a, c, e, i, m, n, o, t, and u), are used in the comparison. There 
are two peaks for the letter "c" because it is the first letter and the eighth letter 
in "communicate". There are two peaks for "m" in the query"communicate" 
but only one for the misspelled model "comunicate". 

25 

• EXAMPLE 2 

Example two illustrates a situation where a cluster table is used but the 
position weight table is not used. In this example, distribution table number 2 
30 is used. 

query: communicate 
model: comunicate 

35 degree(maximum is 136) 131 
similarity: 96.323532%: 



query projection(a): 
model projection(a): 



0 8 20 28 34 28 40 56 80 112 136 112 51 42 30 12 
0 8 20 28 34 28 40 56 80 112 136 112 51 42 30 12 
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query projection (b): 0 0 0 0 0 8 20 28 34 28 20 8 0 0 0 0 
model projection(b): 0 0 0 0 8 20 28 34 28 20 8 0 0 Q 0 0 



i i 



query projection (c): 32 80 112 136 112 80 32 80 112 136 112 80 32 0 0 0 

model projection(c): 32 80 112 136 112 80 32 80 112 136 112 80 32 0 0 0 

query projection(e): 0 12 30 42 51 42 30 42 51 56 68 84 102 84 60 24 

model projection^): 0 12 30 42 51 42 34 42 51 56 68 84 102 84 60 24 

query projection(i): 0 12 30 42 51 42 80 112 136 1 12 68 56 51 42 30 12 

model projection(i): 0 12 30 42 51 42 80 112 136 112 68 56 51 42 30 12 

query projection(k): 16 40 56 68 56 40 16 40 56 68 56 40 16 0 0 0 

15 model projection(k): 16 405668564016405668564016000 

query projection(m): 0 0 32 80 112 136 136 112 34 32 20 8 0 0 0 0 

model projection(m): 0 0 32 8011213611234 3220 8 0 0 0 0 0 

20 query projection(n): 0 0 16 40 56 68 80 112 136 112 80 32 0 0 0 0 

model projection(n): 0 0 16 40 56 80 112 136 112 80 32 0 0,0 0 0 



query projection(o): 0 32 80 112 136 1 12 80 34 34 28 34 28 34 28 20 8 
model projection(o): 0 32 80 112 136 112 34 32 34 28 34 28 34 28 20 8 

query projection(p): 0 8 20 28 34 28 20 8 0 0 0 0 0 0 0 0 

model projection(p): 0 8 20 2834 28 20 8 0 0 0 0 0 0 0 0 

query projection (r): 000000008 20 28 3434 28 20 8 

30 model projection(r): 0 0 0 0 0 0 0 0 8 20 28 3434 28 20 8 

query projection(s): 16 40 56 68 56 40 16 40 56 68 34 40 20 8 0 0 

model projection(s): 16 40 56 68 56 40 16 40 56 68 34 40 20 8 0 0 

35 query projection(t): 0 0 0 0 0 0 0 0 32 80 112 136 112 80 32 0 

model projection(t): 0 0 0 0 0 0 0 0 32 80112136112 80 32 0 



query projection(u): 0 8 20 28 34 80 112 136 34 80 32 28 34 28 20 8 
model projection(u) : 0 8 20 32 80 1 1 2 1 36 1 12 34 32 20 28 34 28 20 8 
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query projection(v): 8 2028342820 8 2028342820 8 0 0 0 

model projection(v): 8 20 28 34 28 20 8 20 28 34 28 20 8 0 0 0 

' ' . * 

5 query projection(w): 0 0 0 0 16 40 56 68 56 40 20 28 34 28 20 8 
model projection(w): 0 0 0 16 40 56 68 56 40 16 20 28,34 28 20 8 

query projection(x): 8 20 28 34 28 20 8 20 28 34 28 20 8 0 0 0 . 

model projection(x): 8 20 28 34 28 20 8 20 28 34 28 20 8 0 0 0 

10 

query projection(y): 0 0 0 0 0 16 40 56 68 56 40 34 28 20 8 0 

model projection(y): 0 0 0 0 0 16 40 56 68 56 40 34 28 20 8 0 . 

query projection(2): 8 2028 342820 8 2028342820 8 0 0 0 
15 model projection(z): 8 20 28 34 28 20 8 20 28 34 28 20 8 0 0 0 

In this example, additional letters are tested because the duster table is 
chosen. For example, the letter "a" is a core letter for the letters "e", "o", "s", 
and "i". The letter "c" is a core letter for the letters "s", "k", "x", 'V, and "z". 
Therefore, the additional letters "b", "k", "p", "r", "s", "v", "w", "x", "y", and 
"z" are analyzed in addition to the letters in "communicate". 



20 



Referring to the results above, there are two peaks for the letter "k", 
corresponding to the position of the letter "c" in "communicate". This is because 
25 "k" is in the cluster of the letter "c". There are also two peaks for each of the 

letters "s", "v", "x", and "z", all cluster letters of the letter "c". The peaks for these 
letters are smaller than that for the letter "k" because their match weight is lower. 



30 



EXAMPLE 3 



This example uses the position weight table, but not the cluster table. 
The position weight table is given by: { 2, 2, 1, 1, 1, 1, 1, 1, 1, l }. This means that 
the first two characters are given twice the weight as the remaining characters. 
This is because most spelling mistakes are made at the beginning of a word as 
35 opposed to the middle or end of a word. Distribution table number 2 is used. 



query: communicate 
model: comunicate 
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degree(maximum is 34) 32 
similarity: 94.117645%: 



query projection(a): 



0 0 0 0 0 0 0 41014171410 4 0 0 



5 model projection(a): 0 0 0 0 0 0 0 4 10 14 17 14 10 4 0 0 

query projection(c): 8 20 28 34 28 20 8 10 14 17 14 10 4 0. 0 0 
model projection(c): 8 20 28 34 28 20 810 14 17 14 10 4 0 0 0 ' 



query projection(i): 0 0 0 0 0 4 10 14 17 14 10 4 0 0 0 0v 

model projection(i): 0 0 0 0 0 4 10 14 17 14 10 4 0 0 0 0 • 

query projection(m): 0 0 4 10 14 17 17 14 10 4 0 0 0 0 0 0 
model projection(m): 0 0 4 10 14 17 14 10 4 0 0 0 0 0 0 0 

query projection(n): 0 0 Q 0 0 4 10 14.1714 10 4 0 0 0 0 
model projection(n): 0 0 0 0 4 10 14 1714 10 4 0 0 0 0 0 

query projection(o): 0 8 20 28 34 28 20 8 0 0 0 0 0 0 0 0 
model projection(o): 0 8 20 28 34 28 20 8 0 0 0 0 0 0 0 ,0 

query projection(t): 0 0 0 0 0 0 0 0 4 10 14 17 14 10 4 0 
model projection(t): 0 0 0 0 0 0 0 0 41014171410 4 0. 

query projection(u): 0000410 14 17 14 10400000 
model projection(u): 0 0 0 4 10 14 17 14 10 4 0 0 0 0 0 0 

The influence of the position weight table is seen in that the peaks for 
the first two letters, namely "c" and "o", are twice that of the peaks for the 
remaining letters, (34-17). Also note that the second occurrence of the letter "c" 
has only a peak of 17, versus the peak of 34 of the first occurrence. 

EXAMPLE 4 



query projection(e): 
model projection(e): 



0 0 0 0 0 0 0 0 0 '4 10 14 17 14 10. 4 
0 0 0 0 0 0 0 0 0 4 10 14 17 14 10 4 



This example uses both a cluster table and a position weight table. The effect of 
the cluster table is shown by the additional cluster letters that are analyzed. 
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The effect of the position weight table is shown for the letter "c" and "o", 
where the peak values are twice as high as for the other letters, (272-136). In 
addition, the first peaks for the cluster letters for "c", ("s", "k", "x", V, and 
"z"), are, twice as high as the second peak, illustrating the' effect of the position 
5 weight table. The peaks for cluster letters for "o", ("u", V, "a", "i", and "p"), 
are higher due to the position weight table, 
query: communicate 

model: comunicate ' 

10 degree(maximum is 272) 263 
similarity: 96.691177%: 



15 



query projection(a): 0 16 40 56 68 56 40 56 80 112 136 112 51 42 30 12 

model projection(a): 0 16 40 56 68 56 40 56 80 112 136 112 51 42 30 12 

query projection(b): 0 0 0 0 0 8 20283428 20 8 0 0 0 0 

model projection(b): 0 0 0 0 8 20 28 3428 20 8 0 0 0 0 0 

query projection(c): 64 160 224 272 224 160 64 80 112 136 112 80 32 0 0 0 

20 model projection(c): 64 160 224 272 224 160 64 80 112 136 112 80 32 0 0 0 

query projection(e): 0 24 60 84 102 84 60 42 51 56 68 84 102 84 60 24 

model projection(e): 0 24 60 84 102 84 34 42 51 56 68 84 102 84 60 24 

25 query projection(i): 0 24 60 84 102 84 80 112 136 1 12 68 56 51 42 30 12 

model projection(i): 0 24 60 84 102 84 80 112 136 112 68 56 51 42 30 12 
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query projection(k): 32 80 112 136 112 80 32 40 56 68 56 40 16 0 0 0 
model projection(k): 32 80 112 136 112 80 32 40 56 68 56 40 16 0 0 0 

query projection(m): 0 0 32 80 112 136 136 112 34 32 20 8 0 0 0 0 
model projection(m): 0 0 32 8011213611234 32 20 8 0 0 0 0 0 

query projection(n): 0 0 16 40 56 68 80 112 136 112 80 32 0 0 0 0 
35 model projection(n): 0 016 40 56 80112136112 80 32 0 0 0 0 0 



query projection(o): 0 64 160 224 272 224 160 34 34 28 34 28 34 28 20 8 
model projection(o): 0 64 160 224 272 224 34 64 34 28 34 28 34 28 20 8 



WO 93/18484 PCT/US93/02179 

-25- 

query projection(p): 0 16 40 56 68 56 40 16 0 0 0 0 0 0 0 0 
model projection(p): 016 40 56 68 564016 0 0 0 0 0 0 0 0 

query projection(r): 0 0 0 0 0 0 0, 0 8 20 28 34 34 28 20 8 
model projection^): 0 0 0 0 0 0 0 0 8 20 28 34 34 28 20 8 



5 



query projection(s): 32 80 1 12 136 112, 80 32 40 56 68 34 40 20 8 0 0 

model projecHon(s): 32 80 112 136 112 80 32 40 56 68 34 40 20 8 0 0 

10 query projection(t): 0 0 0 0 0 0 0 0 32801121361128032 0 

model projection(t): 0 0 0 0 0 0 0 0 32 80 112 136 112 80 32 0 

query projection(u): 0 16 40 56 68 80 112 136 34 80 32 28 34 28 20 8 

model projection(u): 0 16 40 56 80 112 136 112 34 32 20 28 34 28 20 8 

15 

query projection(v): 16 40 56 68 56 40 16 20 28 34 28 20 8 € 0 0 

model projection(v): 16 40 56 68 56 40 16 20 28 34 28 20 8 0 0 0 

query projection(w): 0 0 0 ,0 16 40 56 68 56 40 20 28 34 28 20 8 

20 model projection(w): 0 0 0 16 40 56 68 56 40 16 20 28 34 28 20 8 

query projection(x): 16 40 56 68 56 40 16 20 28 34 28 20 8 0 0 0 

model projection(x): 16 40 56 68 56 40 16 20 28 34 28 20 8 0 0 0 

25 query projection(y): 0 0 0 0 0 16 40 56 68 56 40 34 28 20 8 0 

model projection(y): 0 0 0 0 016 40 56 68 56 40 34 28 20 8 0 

query projection^): 16 40 56 68 56 40 16 20 28 34 28 20 8 0 0 0 

model projection^): 16 4056 68 56 4016 20 28 34 28 20 8 0 0 0 
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EXAMPLE 5 



This example illustrates a comparison of "communicate" and 
"comunicate" without cluster table and without position weights, but using 
35 distribution table number 1. 



query: communicate 
model: comunicate 
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degree(m aximum is 21) 20 
similarity: 95.238098%: 

query projection(a): 0 0 0 0 0 0 0 410141719 211917.1410 4 0 0 
5 model projection(a): 0 0 0 0 0 0 0 4 10 14 17 19 21 19 17 14 10 4 0 0 ' 

query projection(c): 4 10 14 17 19 21 19 17 14 17 19 21 19 17 14 10 4 0 0 0 
model projection(c): 4 10 14 17 19 21 19 17 14 17 19 21 19 17 1410, 4 0 0 0 

10 query projection(e): 0 0 0 0 0 0 0 0 0 4 10 14 17 19 21 19 17 14 10 4 
model projection(e): 0 0 0 0 0 0 0 0 0 4 10 14 17 19 21 19 17 14 10 4 



15 



query projectionG): 0 0 0 0 0 4 10 14 17 19 21 19 17 14 10 4 0 0 0 0 

model projectionG): 0 0 0 0 0 4 10 14 17 19 21 19 17 14 10 4 0 0 0 0 

query projection(m): 0 0 4 10 14 17 19 21 21 19 17 1410 4 0 0 0 0 0 0 

model projecrionCm): 0 0 410141719 21 19171410 4 0 0 0 0 0 0 0 

query projectionG*): 0 0 0 0 0 4 10 14 17 19 21 19 17 1410 4 0 0 0 0' 

model projection(n): 0 0 0 0 4 10 14 17 19 21 19 17 14 10 4 0 0 0 0 0 

query projection(o): 0 4 10 14 17 19 21 19 17 14 10 4 0 0 0 0 0 0 0 0 

model projeetion(o): 0 410141719 21 19171410 40 00 00000 

25 query projection(t): 0 0 0 0 0 0 0 0 4 10 14 17 19 21 19 17 14 10 4 0 

model projectionG): 0 0 0 0 0 0 0 0 41014 17 19 21 19 17 14 10 4 0 



20 
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query projection(u): 0 0 0 0 4 10 14 17 19 21 19 17 14 10 4 0 0 0 0 0 
model projection(u): 0 0 0 4 10 14 17 19 21 19 17 14 10 4 0 0 0 0 0 0 

EFFECTS OF OPTIONAL TABLES 



If a word such as "communicate" is misspelled as "comunicate", 
generally we may say that these two words have 1 character different out of 11 
35 characters. Thus, "comunicate" is 91% similar to "communicate". However, 
the actual similarity of the two words is higher. Using the present invention, 
with cluster tables, position weights and distribution table number 1, the 
similarity becomes approximately 97%. 
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Now compare the result of comparing "communicate" with 
"communikate" and "communigate". With cluster table above, a better 
similarity comes out when "communicate" is compared with "communikate" 
than compared with "communigate" (94.3% vs 91.7%). It means that • • 
5 "communikate" is more likely to be "communicate" than "communigate" is, 
since "k" and "c" sometimes may have same pronunciation, and "k" is in the 
cluster of "c". With the position weight table a better similarity <94.3% vs 
92.2%) is achieved while comparing "communicate" with "communikate". 

10 A block diagram of the preferred embodiment of the .present invention . 

is illustrated in Figure 3. A query string 301 is provided to normalizing block 
302. Model vectors from model storage 313 are provided to normalizing block 
302 on line 311. The normalizing block 302 normalizes the data string of S 
symbols into a normalized image of N symbols. The normalized image 303A 

15 of the query and the normalized image 303B of the model vector are provided 
, to the projection generating block 304. 

A first memory means 305 for storing a cluster table is switchably • 
coupled to projection generating means 304 through switch 307. A, second 
20 memory means 306 for storing a position weight table is switchably coupled to 
projection generating means 304 through switch 308. Switches 307 and 308 can 
be independently controlled so that the projection vector 309 generated by 
projection generating block 304 may optionally include the effect of the cluster 
table 305 and/or the position weight table 306. 

25 

The projection vector 309A of the normalized query 303A, and the 
projection vector 309B of the normalized model vector 303B, are provided to 
projection matching block 309. The projection matching block 310 generates a 
similarity value 312, representing the degree of similarity between the 
30 projection vector 309A of the query and the projection vector 309B of the 

model vector. The projection matching block 310 operates in accordance with 
the algorithm: 



2>i + 5X 



35 



WO 93/18484 



PCT/US93/02179 



-28- 

where Pi and P 2 are two projections to be compared. When two projections 
are identical, or two original strings are identical, the similarity is 1. The 
lowest possible © isO. , 

The first, second, and third memory means of Figure 3 can be 
implemented as three address regions in a single memory. In addition, the 
apparatus of Figure 3 can be implemented on a processor as a plurality of 
processor executable instructions. 

Thus, a method and apparatus for comparing data strings has been 
described. 
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APPENDIX A 

* zfmpopen - Projection Matching: open a ZFMR structure 

* DESCRIPTION • , . 

5 * allocate and initialize zfmpenv structure 

*/ 

zfmpref * • 

zfmpopen (size, maxsim, poswts, dist, clusters) 
10 reg6 eword size; 

regl2 eword maxsim; 
reg8 ub2 *poswts; 
reg7 ub2 *dist; 
reg!3 zfmpclut *clusters; 

15 { 

regO zfmpenv *pe_jp; /* pointer to return */ 
regl e;word i; 



20 



/* following variables are calculated from the parameters */ 

reg4 eword neighbors; 
reg5 eword closure; 
reglO eword npos; 

25 /* We use array indexes instead of pointers , because we don't 

want to distroy dist and poswts. The overhead is minor 
since zfmpopen is only called once for each session. */ 

for (i - 0; dist[i]; ++i) ; /* [sic] how many neighbors */ 
30 neighbors » i - 1; 

for (i *= 0; poswts [i]; ++i) ; /* [sic] how many positions */ 
closure « i + neighbors * 2; 
npos «= i; 



35 



#ifdef DEBUG 

printf ("neighbors = %d, closure - %d, npos = %d\n", 

neighbors, closure, npos); 
printf ("dist [] : ") ; 
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for (i = 0; i < neighbors + 1; ++i) printf ( n %d w , dist[i]); 
printf ( n \nposwts[] : ") ; * 
for (i c. 0; i < npos; ++i) printf ("%d poswts[i]); 
printf ( n \n n ) ; 
5 #endif 



/* allocate ZFMP , environment * / 

if (!(pe_p « (zfmpenv *)malloc (sizeof (zfmpenv) ) ) ) 

10 { 

return ( (zfmpref *) 0) ; 

} 



15 



20 



» 



pe_p->pe_s i ze 




size; 


pe_p->pe_ maxsim 




maxsim; 


pe__ p->pe_closure 




closures- 


pe_p->pe_neighbors 




neighbors; 


pe_ p->pe_npos 




npos; 


pe_jp->pe__dist 




dist; 


pe_p->pe_poswts 




pbswts; 


pe_p->pe_clusters 




clusters; 


pe_jp->pe_qpro j s 




0; 


pe_jp->pe_mpro j s 




0; 



25 /* allocate memory */ 

if ( ! <pe_p->pe_qpro js - (ub2 *)malloc (sizeof (ub2) * size * closure)) 

I I 

• (pe_p->pe_mprojs = (ub2 *)malloc (sizeof (ub2) * size * closure))) 

30 { 

zfmpclose ( (zfmpref *)pe_p) ; 
return ( (zfmpref *) 0) ; 
} 



35 /* cast and return */ 



return ( ( z f mp r e f * ) pe_p ) '; 

} 
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APPENDDC B , 

* NAME , 

* zfmpquery - Projection Matching: ' set a query' ' ' 
5 * DESCRIPTION 

* set the query to the string given and generate its projections 
*/ • . ' • 

VOid 

10 zfmpquery (pe_h, query, qlen) 

1 regO zfmpref *pe_h; » 
regl text *query; 

reg2 eword qlen; V 

(. 

15 /* do projections */ 

zfmp_c (pe_h) ->pe_qsum = zfmpro js (zfmp_c <pe_h) , query, 

qlen, 

zfmp_c <pe_h) — 

20 >pe qprojs) ; 

#ifdef DEBUG 
{ 

int i, j; , 
int qsum; 



25 



qsum * 0; 



for (i - 0; i < zfmp_c <pe_h) ->pe_size; ++i) 
{ 

30 printf <"%c: i + •a'); 

for (j =0; j < zfmp_c (pe_h) ->pe_closure; ++j) 
{ 

printf <"%d zf mp_c <pe_h) ->pe_qpro js{i] [ j] ) ; 

qsum += zfmp_c (pe__h) 

35 >pe_qprojs [i] [ j] ; 

J 

printf ("\n") ; 

} 
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printf ("pe__qsum = %d, qsum = %d\n", zfmp_c (pe_h) ->pe_qsvim / qsum); 

} : . . 

#endif 
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APPENDDC-C 

* NAME 

* zfmpmodel - compute the similarity index ' ' 
5 * DESCRIPTION 

* generate projections for the model and compare the projections 

* to those of query , 

■*/ 

10 eword 

* zfmpmodel <pe_h, model, mien) t • 

regO zfmpref *pe_h; 
regll text *model; 

regl2 eword mien; * 

15 { 



20 



35 



reg6 


eword 


i; 






reg7 


ub4 


sigma; 


/* 


total of projections */ 


reg9 


eword 


delta; 


/* 


difference between two prjections 


reg8 


ub2 


*qprojs; 


/* 


projections from zfmpenv.*/ 


reg5 


ub2 


*mpro js; 







/* get pointers */ 

* 

qprojs « zfmp_c<pe_h)->pe_qprojs; 
25 mprojs = zfmp_c <pe_h) ->pe_mprojs; 

/* do projections for the model and get the sigma */ 

sigma ■= (ub4 ) zfmp_c (pe_h) ->pe_qsum + 
30 zfmprojs (zfmp_c (pe_h) , model, 

mien, 
mprojs); 



/* calculate the difference */ 
delta = (eword) 0; 

for (i = zfmp_c <pe_h) ->pe_size * zfmp_c <pe_h) ->pe_closure; 
i; 
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— if ++qprojs, ++mprojs) 



delta *qprojs > *mprojs? (eword) *qprojs - 



mpro3s: 
(eword) *mprojs *qprojs; 
t , 



return ( (eword) ( (sigma - delta) ,*. (zfmp_c <pe_h) ->pe_maxsim) / 
sigma) ) ; 

} 1 ' 
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CLAIMS 

1. A method of comparing a first string of symbpls with a second 
string of symbols, said method comprising {he steps of: 
5 * 

t 

normalizing said first string to create a first normalized string; 
generating a first projection from said first normalized string; 
10 normalizing said second string to create a second normalized string; 

generating a second projection from said second normalized string; 

comparing said first projection and said second projection to determine 
15 a degree of similarity of said first and second projections! 

2. The method of claim 1 wherein said steps of generating said first 
projection and said second projection include the use of cluster tables. 



20 



3. The method of claim 1 wherein said steps of generating said first 
projection and said second projection include the use of position weight tables. 



4. The method of claim 1 wherein said steps of generating said first 
projection and said second projection include the use of cluster tables and 

25 position weight tables. 

5. The method of claim 1 wherein said step of normalizing said first 
string comprises generating a medium of a symbol, M, in a normalized image 
by: 



30 
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M(Si)=i* INI /IS I 

where Si is the i-th symbol in a string S, INI is the normalized size, and I S I is 
the length of string S 

6. The method of claim 1 wherein said projection of said first string 
is generated by projecting said first string onto its closure in a normal 
distribution by: 
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C s t M(S t ) + \p\ + j d ( 

c . t M<s t > + \o\- f = d i < j = 0,l,2,..., 1DI) 

5 where D is a distributing series, I D I is distribution size, dj is the j-th item in 
distribution series D, and c^k is the k-th item in symbol Si's closure. 

♦ 

7. The method of claim 2 wherein said step of generating said 
projection of said first string with the use of a cluster table is accomplished by: 

10 

C n M(S.) + |Dk; " ^ / U S. n , \ 

C -M(5 i) + bl-/ = d t* U S,n (1 = 0/1,2,..., IDI) ' 

15 where D is a distributing series, I D I is distribution size^ dj is the j-th item in 
distribution series D, c^k is the k-th item in symbol Si's closure, and Hs-n is 

weight of symbol n in the cluster whose core is Si. 

8. The method of claim 3 wherein said step of generating said 
20 projection of said first string with the use of position weight tables is 

- accomplished by: 



C nM(S f ) + lD|+; d j W M (S f ) 
25 C nMlS t ) + \Dl-f- d j W M{S t ) 



(j=0,l,2,..., IDI) 



where D is a distributing series, I D I is distribution size, dj is the j-th item in 
distribution series D, Csjk is the k-th item in symbol Si's closure and WM(Si) is a 
weight on position M(Si). 

30 

9. The method of claim 4 wherein said step of generating said 
projection of said first string with the use of a cluster table and a weight table 
and is accomplished by: 



35 



C nM(S t ) + h\+j d j W MlS f ) U S.n 
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where D is a distributing series, I D I is distribution size, dj is the j-th item in, 
distribution series D, c s ^ is the k-th item in symbol Si's closure, u s . n is a cluster 

5 weight, and w is a position weight. 

10. Apparatus for comparing a first string of symbols with a second 
string of symbols comprising: 

10 normalizing means for normalizing said first string to create a first 

normalized string and for normalizing said second string to create a second 
normalized string; 

♦ 

projection generating means coupled to said normalizing means for 
15 generating a first projection from said first normalized string and for 
generating a second projection from said second normalized string; 

comparing means coupled to said projection generating means for 
comparing said first projection and said second projection to determine a 
20 degree of similarity of said first and second projections. 

11. The apparatus of claim 10 wherein generating said first projection 
and said second projection is accomplished with the use of cluster tables. 

25 12. The apparatus of claim 10 wherein generating said first projection 

and said second projection is accomplished with the use of position weight 
tables. 

13. The apparatus of claim 10 wherein generating said first projection 
30 and said second projection is accomplished with the use of cluster tables and 

position weight tables. 

14. The apparatus of claim 10 wherein normalizing said first string 
comprises generating a medium of a symbol, M, in a normalized image by: 

35 

M(Si)=i* INI/ISI 
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where Sj is the i-th symbol in a string S, INI is the normalized size, and I S I is 
the length of string S 

i 

15. , The apparatus of claim 10 wherein said projection of said first 
5 string is generated by projecting said first string onto its closure in a normal 
distribution by: 

c i i ( = d 1 1 

S.M(S.) + |p| + ; 

C * M(S,) + Id|-/ ~ ^/ 

10 ' ' . ; (j = 0,l,2,..., IDI) 

• • • 
where D is a distributing series, I D I is distribution size, dj is the j-th item in 
distribution series D, and c^k is the k-th item in symbol Si's closure. 

15 16. The apparatus of claim 11 wherein generating said projection of 

said first string with the use of a cluster table is accomplished by: 

C nU(S l ) + \D\+i = ^ } U S.n 



20 



C nM(S f) + bl-/ «S ( n (j.= 0, 1, 2, I D I ) 



where D is a distributing series, I D I is distribution size, dj is the j-th item in 
distribution series D, c Si k is the k-th item in symbol Si's closure, and u s . n is a 

cluster weight. 

25 

17. The apparatus of claim 12 wherein generating said projection of 
said first string with the use of position weight tables is accomplished by: 

C nM{S l ) + ]D\+j = W M{S ( ) 

30 

C „M,S f) + I D |-; = V *> MVl} 0=0,1,2,..., IDI) 

where D is a distributing series, I D I is distribution size, dj is the j-th item in 
distribution series D, c Si k is the k-th item in symbol Si's closure and w is a 

35 position weight. 
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18. The apparatus of claim 13 wherein generating said projection of 
said first string with the use of a cluster table and a weight table and is 
accomplished by: , 



C „M(S ( , + lDl + /- rf /* "Mis/ U S t 



n 



'-M^tol-/- d r W M ( s/ «s ( - •. (j=0, 1, 2/ ID I ) 

where D is a distributing series, I D I is distribution size, dj is the j-th item in 
10 distribution series D, c s .k is the k-th item in symbol Sj's closure, u s . n is a cluster 

weight, and w m <si) is a position weight. ' 
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